Attention and feature transfer based knowledge distillation

Existing knowledge distillation (KD) methods are mainly based on features, logic, or attention, where features and logic represent the results of reasoning at different stages of a convolutional neural network, and attention maps symbolize the reasoning process. Because of the continuity of the two in time, transferring only one of them to the student network will lead to unsatisfactory results. We study the knowledge transfer between the teacher-student network to different degrees, revealing the importance of simultaneously transferring knowledge related to the reasoning process and reasoning results to the student network, providing a new perspective for the study of KD. On this basis, we proposed the knowledge distillation method based on attention and feature transfer (AFT-KD). First, we use transformation structures to transform intermediate features into attentional and feature block (AFB) that contain both inference process information and inference outcome information, and force students to learn the knowledge in AFBs. To save computation in the learning process, we use block operations to align the teacher-student network. In addition, in order to balance the attenuation ratio between different losses, we design an adaptive loss function based on the loss optimization rate. Experiments have shown that AFT-KD achieves state-of-the-art performance in multiple benchmark tests.

transfers the intermediate features and their corresponding attention attempts to the student network at the same time to achieve better performance.. Our work mainly consists of two parts.The first part is about how to get inference information.First, we find the extensibility of 19 , where the operation of generating a class activation map (CAM) by projecting the weights of the output layer back into the convolutional feature map can be easily extended to other convolutional layers.Inspired by 30 , we use 1 × 1 point convolution to generate the attention map corresponding to the intermediate feature map and superimpose it on the original feature map after binarization (Fig. 2), which we call the attention and feature block (AFB).Then, according to the structure of CNN, we divided it into different stages to simulate different reasoning moments, and forced the student network to approach the AFB of all stages.The advantage of this is that students can learn the complete reasoning process, and the block operation reduces the required calculation amount.The second part is about how to balance the loss function.We refer to the error generated by approximating AFB as KD Loss and the error between the predicted output and the truth label as cross entropy Loss (CE Loss).In order to balance the rate of optimization between KD Loss and CE Loss and prevent the loss of accuracy due to continued training after convergence of one loss, we designed an Adaptive loss function to adjust the loss weight using the ratio of the loss decay rate to the expected rate of the two loss decay rates.
Overall, our contributions are summarized as follows: Finally, we use a total of five chapters to arrange the content of the article.The second chapter introduces the related work of various knowledge distillation methods according to the traditional classification method, and analyzes the connection and deficiency between different methods.The third chapter introduces the proposed method in detail, including CAM review, AFT-KD theoretical analysis and Adaptive Loss implementation.The fourth chapter contains all the experimental content.First, we introduced the data set used in the experiment, then analyzed the influence of the information contained in AFB on the distillation performance, and further verified the performance superiority of AFB-learned AFB method.Finally, we verified and analyzed the actual performance of Adaptive Loss.In the last chapter, we summarize the proposed methods and analyze their limitations and what we will do next.

Related work
The concept of knowledge distillation (KD) was proposed by Hinton et al. 21, which forced the student network to extract knowledge from the soft labels and ground truth labels provided by teachers.In order to make full use of the "dark knowledge" contained in soft labels, the concept of temperature was introduced.The existing KD methods can be mainly divided into three types: logic-based 20,21,[31][32][33][34] , feature-based 18,[22][23][24][25][26][27][28][29]35 , and attention maps-based 19,30 .
Logic distillation transfers the knowledge implicit in the output logic of the teacher model to the student network.BAN 32 obtained superior performance to the teacher model by directing the same parameterized network as the teacher.DKD 20 reformulates KD loss into target-class knowledge distillation (TCKD) and non-target-class knowledge distillation (NCKD), revealing that KD's coupling formula limits the effectiveness and flexibility of knowledge transfer.CrossKD 34 passes intermediate features of the student network to the teacher's detection head, resulting in cross predictions, which are then forced to mimic the teacher's predictions.In addition, there are several articles on logical distillation methods 21,33,34 .
Feature-based KD methods tend to have better performance, forcing students to extract valid content from intermediate features of the teacher network at the cost of requiring more computation than logical distillation.RKD 25 can transform the relationship of data examples to punish differences in teacher and student relevance, similar to the transfer of sample relevance studies from teacher and student networks 26,27 .PKT 35 models the teacher's knowledge as a probability distribution and uses KL divergence to measure distance.RKD 25 uses multicase relationships to guide students' learning.CRD 22 combines comparative learning with knowledge distillation, and uses comparative objectives to carry out knowledge transfer.ReviewKD 18 uses cross-layer connection paths to integrate the knowledge implied by features at different levels.
KD method based on attention diagram instructs students what information the network should pay attention to in reasoning.AT 30 verifies the validity of shifting attention diagram, which uses class activation graph to transfer knowledge to student network.CAT-KD 19 reveals that the ability to distinguish category regions is the key to network classification, and proves that this ability can be acquired and enhanced by transferring CAM.CAT-KD can transfer knowledge by transforming structure to obtain attention force, which makes attentionbased knowledge distillation has a good competitiveness.
The KD method based on transfer logic and features has good performance, while the KD method based on transfer attention diagram has high interpretability.Previous studies have ignored the link between these two characteristics.In this paper, we aim to solve this problem by proposing a KD approach based on attention and feature transfer, which is advanced in several benchmark tests.

Our method
In this section, we first review CAM and analyze its scalability, then further propose the attention-feature fusion AFB and apply it to knowledge distillation, and finally propose the adaptive Loss function combined with the decay rate of KD Loss and CE Loss.

Review the CAM
First we consider A commonly used CNN structure with the output feature A ∈ R C×H×W of its last convolu- tional layer, where C represents the number of channels and H and W represent the height and width of the feature, respectively.A i ∈ R H×W denotes the feature of the i-th channel.A i x, y denotes the activation at spatial location x, y on channel i.At this point, the process of generating prediction results by regular CNN can be expressed as follows.
(1) where P j represents the logical prediction of the j-th category, and W j i represents the weight corresponding to the j-th category in the fully connected layer.In 30 , the class activation graph (CAM) is generated by projecting the weight of the fully connected layer in CNN back to the convolutional feature map, so we can get the calculation formula corresponding to CAM for the j-th category: Equation ( 2)can also be rewritten as: According to Eqs. (1 and 3), the relationship between logical output P j and logical output CAM j is as follows: As shown in Eq. ( 4), the logical output of class j can be obtained by calculating the average of the corresponding class CAM.This kind of operation to obtain further inference results through feature mapping is very common in CNN.For example, the feature graph generated by the convolution kernel is regularized and activated to obtain the final conclusion.Therefore, we can easily generalize the CAM calculation process to other locations in the CNN, and the generated CAM-like map corresponds to the output feature map of the layer.

AFT-KD
To gain knowledge related to the inference process, we extend CAM to all convolutional layers of the CNN.First, we formulate the general operation of the convolution layer, assuming that CNN contains L convolution layers, where the input feature of the k-th convolution layer is F k−1 ∈ R C×H×W , and C, H, W represents the channel number, height and width of the input feature respectively.Then the input feature of the k + 1 convolution layer is: where ACT k g , Norm k g , Conv k g represents activation function, regularization function and convolution operation of the k layer respectively.Pr e_F k represents the pre-activated output feature.We approximate the activation process of the pre-activated output feature as the mapping from the inference process to the inference result.In order to extract knowledge about the inference process in F k , we use the pre-activated output feature Pr e_F k to compute the class CAM corresponding to F k .Inspired by 18 , we use point convolution to replace the weight of the fully connected layer, then formula 3 can be rewritten as: where conv j g represents the convolution kernel of the j-th channel corresponding to the output feature.Com- bined with Eq. ( 6), the attention map corresponding to F k can be expressed as: where CAM L k represents the CAM class corresponding to F k .Bin g stands for binarization, and using binariza- tion operations can further highlight what convolution is concerned with when reasoning.Finally, we superposition CAM L k and F k to obtain the output feature that integrates the inference process and inference result, which is called Attention and Feature Block (AFB).The calculation process of attention and feature block of the k-th convolutional layer is as follows: On this basis, we propose AFT-KD, which forces students to simulate the knowledge in AFB transferred by teachers.In order to save computation and align the student and teacher networks, we divide CNN into N inference stages, where AFT losses can be defined as: where, C k represents the number of channels in the output feature graph of the k stage after adjustment.φ g is the adjustment function used to unify the number of channels and resolution of AFB k and F k .F k is the output feature map of the k stage student network. (2)

Adaptive loss
In order to balance the optimization rate between different losses, we design an adaptive loss function based on the loss optimization rate.After determining AFT losses, we define the overall losses of AFT-KD as: where L CE represents the standard cross entropy loss, α, β is the dynamic hyperparameter used to balance L CE and L AFT .When the optimization rates of L CE and L AFT are different, there will be a phenomenon of premature convergence of a certain loss in training.In this case, continuing to train the network can reduce the unconvergent loss function, but at the cost of sacrificing the precision of the convergent task.In order to avoid this phenomenon, we introduce the loss optimization rate to balance the decay rate of the two loss functions.First, we need to record the initial loss L 0 CE and L 0 AFT at the beginning of training, and record the current loss L CE and L AFT at any iteration.At this point we can calculate the loss attenuation rate in this iteration: where Dr_CE and Dr_AFT are respectively the attenuation rates of L CE and L AFT relative to the initial loss in this round of iteration.The average attenuation rate Dr = (Dr_CE + Dr_AFT)/2 , α, β can be expressed as: According to Eqs. (11 and 12), for the task that is optimized faster, its dynamic coefficient is a positive number less than 1 in this round, and the faster the optimization, the smaller the coefficient, which will reduce its optimization efficiency in this round to achieve the purpose of balancing another task.The experiment shows that the adaptive loss can effectively shorten the distance between Dr_CE and Dr_AFT.

Datasets
Our experiments are mainly conducted on two image classification datasets: 1. CIFAR-100 36 contains a total of 60,000 32 × 32 pixel pictures in 100 categories, among which the training set and the verification set contain 50,000 and 10,000 pictures respectively.2. ImageNet 37 is a large-scale dataset containing 1000 classification objects, including 1.2 million training images and 50,000 verification images.

Implementation details
Our experimental setup on CIFAR100 and ImageNet is strictly followed 19,20 .In the lab on CIFAR100, we trained 240 epochs using the SGD optimizer with the batch size set to 64.The initial learning rate of 0.05 (0.01 for ShuffleNet 38,39 and MobileNet 17 ) was divided by 10 at 150, 180, and 210 iterations.For the experiment on Ima-geNet, we trained 100 epochs with a batch size of 512.The initial learning rate is 0.2, and every 30 epochs decays to one-tenth of the original.In addition, we conducted experiments on various representative CNN networks :VGG 40 , ResNet 41 , WideResNet 42 , MobileNet 17 , and ShuffleNet 38,39 .Table 1 provides a brief overview of these networks.
In order to ensure the fairness of experimental results, the results of existing methods are either reported in their articles [18][19][20] or obtained using the code provided by them with exactly the same Settings.All results of CIFAR100 are the average of 5 trials, while the results of ImageNet are the average of 3 trials.

Exploration of AFB
In this section, we first explore the effectiveness of transferring AFB, and then propose that learning continuous reasoning knowledge has greater benefits for students.

VGG is constructed by stacking convolutional layers and connecting fully connected layers, and uses the relu activation function in a unified way. Common models include VGG16/VGG19
ResNet 41 ResNet introduced residual structure to alleviate the problem of gradient disappearance/explosion, commonly used structure Resnet 34/50/101 and so on WideResNet 42 On the basis of ResNet, the network width is increased to further improve the training speed of the network.AFB contains more complete inference information than CAM and output features Since the reasoning process is closely related to the reasoning outcome, a single transfer of the characteristics associated with the reasoning outcome from the teacher can cause students to imitate the incorrect reasoning process.The AFB contains complete inference information and theoretically gives the student a benefit, a gain that can be directly observed through the student's performance on the classification task.We transferred different information from ResNet32 × 4 to ResNet8 × 4, including (1) only output features, (2) only CAM, and (3) AFB, and observed the performance on CIFAR-100.To align the teacher and student network, we divide the reasoning process into three stages based on ResNet's layer grouping.As shown in Table 2, transferring CAM or output features alone can also achieve good classification accuracy, but transferring AFB can further improve performance, suggesting that AFB carries more complete inference knowledge than CAM or output features, and this knowledge directly leads to student performance improvement.In addition, moving AFBs to different locations will bring different improvements to the performance of the subnetwork.

Transferring continuous AFB is more beneficial to students
The forward propagation of data in the network has time continuity.The process from the input picture to the output prediction logic can be divided into different stages, and the reasoning results of the previous stage are used as the reasoning inputs of the later stage.Therefore, the reasoning information (including the reasoning process and reasoning results) of adjacent stages are also closely related.We have experimentally demonstrated that delivering continuous and complete AFB to students leads to more performance improvements.
Similarly, we use ResNet32 × 4 as the AFB producer and ResNet8 × 4 to learn the different stages of AFB.We divided the experiment into three groups: (1) only the AFB generated by Layer1 was studied; (2) Learn the AFB generated by Layer1-2; (3) Learn the AFB generated by all three layers.Layer1-3 corresponds to Stage1-3 in Fig. 1, where N equals 3.As shown in Table 3, learning only part of the inference knowledge has a similar classification accuracy (higher than the baseline network), but learning the full AFB results in a substantial improvement in network performance.

Results on CIFAR-100
Similar to the work 19,20 , we carried out experiments on CIFAR-100 using the same teacher-student framework and different teacher-student framework respectively, and the Results were reported in Tables 4 and 5. Our approach implements new SOTA on both the same teacher-student architecture and some experiments with different teacher-student architectures.Among them, compared with the logic-based approach, our approach has achieved better results in all experiments with different architectures.However, when the student model is MobileNetV2, the performance of AFT-KD is slightly lower than 18,19 , which we speculate is due to the large difference between the teacher-student architecture and the simplistic alignment we used.In experiments with the same architecture, the accuracy of the student models trained by our method exceeded that of the teachers.Especially when the teacher model is ResNet32 × 4, the accuracy of AFT-KD reaches77.14%,which is 4.64% higher than the teacher model.

Results on ImageNet
Tables 6 and 7 give the top-1 and top-5 accuracy of image classification on **ImageNet.Our method does not achieve the best performance due to the teacher's ability, but it is still better than most KD methods.
Finally, we compare the performance of several SOTA methods on CIFAR-100, where the training set decays in different proportions, in line with the practice in 18 , and in doing so assess their dependence on the amount of training data.As shown in Fig. 3, AFT-KD is least affected by the amount of training data, demonstrating the excellent distillation efficiency of our method.

Exploration of adaptive loss
From the analysis in section "Adaptive loss", we can see that the optimization rate imbalance of the loss function in multi-task learning will lead to the problem of decreased accuracy in the later training period.The adaptive loss function automatically adjusts the loss value according to the optimization rate of AFT loss and cross-entropy loss, which effectively alleviates the above problems.This improvement can be measured by the distance between the loss decay rate curves of multiple tasks, and Fig. 4 illustrates the comparison of the optimized rate curves before and after adopting the adaptive loss function.We sampled and computed the mean of the raw data for optimizing rates, resulting in a smoothed contrast curve.It can be observed that, after balancing the optimization rates of two losses using an adaptive loss function, the distance between the loss decay curves has decreased, and both losses are optimized in a more synchronized manner.This improvement has also effectively enhanced the model's classification performance.

Conclusion
We analyze the existing KD methods and reclassify these methods into those based on the inference process and the inference result, which provides a new perspective for the study of knowledge distillation.Based on this, we propose AFT-KD with attention and feature transfer, which achieves competitive results on several commonly used benchmarks.Finally, in order to balance the loss optimization rate in AFT-KD, we propose an adaptive loss function based on loss decay rate to further improve the performance of AFT-KD.However, in experiments in which teachers and students adopt different architectures, the performance of AFT-KD is not the best among all methods, which we guess is caused by the large difference between teachers and students' architectures, and the alignment method we used is too simple.This limitation can be improved by designing specific alignment rules for network structures.In addition, compared with feature-based KD method, AFT-KD brings a certain improvement in operation cost, which is also something we need to continue to explore in future work.

Figure 1 .Figure 2 .
Figure 1.Illustration of AFT-KD.To align the teacher and student network, we divide it into N stages of reasoning.The value of N varies in different teacher-student structures.For example, N equals 3 in the experiment in section "Exploration of AFB". https://doi.org/10.1038/s41598-023-43986-ywww.nature.com/scientificreports/

Figure 3 .
Figure 3. Accuracy(%) of student trained with several SOTA methods on the CIFAR-100.We set ResNet32 × 4 as the teacher and ResNet8 × 4 as the student, and the training set is reduced at various ratios.

Figure 4 .
Figure 4. Comparison of optimized rate curves before and after using adaptive loss function.

Table 1 .
Several common neural networks are briefly introduced.

Table 4 .
Results on CIFAR-100.Teachers and students have different architectures.Significant values are in [bold].

Table 5 .
Results on CIFAR-100.Teachers and students have the same architecture.Significant values are in [bold].

Table 6 .
Results on ImageNet.we set ResNet34 as the teacher and ResNet18 as the student.Significant values are in [bold].

Table 7 .
Results on ImageNet.we set ResNet50 as the teacher and MobileNet as the student.Significant values are in [bold].